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1 Introduction 

This chapter provides a overview of Bayesian inference, mostly emphasising that it is a 
universal method for summarising uncertainty and making estimates and p redictions usin g 
probability statements conditional on observed data and an assumed model dGelmanl 2008). 
The Bayesian perspective is thus applicable to all aspects of statistical inference, while being 
open to the incorporation of information items resulting from earlier experiments and from 
expert opinions. 

We provide here the basic elements of Bayesian analysis when considered for standard 
models, refering to IMarin and Robert! d2007l) and to lRobertl J2007I) for book-length entries. 1 
In the following, we ref rain from emb arking upon philosophical discussions about the nature 
of knowledge (see, e.g., Robert 2007, Chapter 10), opting instead for a mathematically sound 
presentation of an eminently practical statistical methodology. We indeed believe that the 
most convincing arguments for adopting a Bayesian version of data analyses are in the 
versatility of this tool and in the large range of exi sting applicat ions , rather than in those 
polemical arguments (for such perspectives, see, e.g., |javnesll2003l and lMacKavll2002h . 



2 The Bayesian argument 

2. 1 Bases 



We start this section with some notations about the statistical model that may appear to be 
over-mathematical but are nonetheless essential. 

Given an independent and identically distributed (iid) sample Q) n = (x\, . . . , x n ) from a 
density fg, with an unknown parameter 9 € G, like the mean /i of the benchmark normal 



'The chapter borrows heavily from Chapter 2 of Marin and Robert (2007). 
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distribution, the associated likelihood function is 

n 

t(9\9 n ) = llM 



i=l 



This quantity is a fundamental entity for the analysis of the information provided about the 
parameter 9 by the sample and Bayesian analysis relies on this function to draw inference 
on 9. 2 Since all models are ap proximations of re ality, the choice of a sampling model is wide- 



open for criticisms (see, e.g.. lTempletonl l2008), but those criticism go far beyond Bayesian 



modelling and question the relevance of completely built models for drawing inference or 
running predictions. We will therefore address the issue of model assessment later in the 
chapter. 

The major input of the Bayesian perspective, when compared with a standard likelihood 
approach, is that it modifies the likelihood — which is a simple function of 9 — into a posterior 
distribution on the parameter 9 — which is a probability distribution on defined by 



v 1 n ' f £(0\@ n )ir(6)d6' 

The factor ir{9) in (Q} is called the prior (often omitting the qualificative density) and it 
necesssarily has to be determined to start the analysis. A primary motivation for introducing 
this extra-factor is that the prior distribution summarizes the prior information on 9; that is, 
the knowledge that is available on 9 prior to the observation of the sample 3> n . However, the 
choice of n(9) is often decided on practical or computational grounds rather than on strong 
subjective beliefs or on overwhelming prior information. As will be discussed later, there 
also exist less subjective choices, made of families of so-called noninformative priors. 

The radical idea behind Bayesian modelling is thus that the uncertainty on the unknown 
parameter 9 is more efficiently modelled as randomness and consequently that the probability 
distribution it is needed on as a reference measure. In particular, the distribution Pg of the 
sample then takes the meaning of a probability distribution on *3l n that is conditional 
on [the event that the parameter takes] the value 9, i.e. f$ is the conditional density of x 
given 9. The above likelihood offers the dual interpretation of the probability density of & n 
conditional on the parameter 9, with the additional indication that the observations in @ n are 
independent given 9. The numerator of (Q]) is therefore the joint density on the pair (J2> ni 9) 
and the (standard probability calculus) Bayes theorem provides the conditional (or posterior) 
distribution of the parameter 9 given the sample @ n as ([]]), the denominator being called the 
marginal (likelihood) m(@ n ). 

There are many arguments which make such an approach compelling. When defining 
a probability measure on the parameter space 0, the Bayesian approach endows notions 
such as the probability that 9 belongs to a specific region with a proper meaning and those 
are particularly relevant when designing measures of uncertainty like confidence regions or 
when testing hypotheses. Furthermore, the posterior distribution ([TJ can be interpreted as the 
actualisation of the knowledge (uncertainty) on the parameter after observing the data. At this 
early stage, we stress that the Bayesian perspective does not state that the model within which 
it operates is the "truth", no more that it believes that the corresponding prior distribution n 



2 Resorting to an abuse of notations, we will also call £(9\&) n ) our statistical model, even though the distribution 
with the density fg is, strictly speaking, the true statistical model. 
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it requires has a connection with the "true" production of parameters (since there may even 
be no parameter at all). It simply provides an inferential machine that has strong optimality 
properties under the right model and that can similarly be evaluated under any other well- 
defined alternative model. Furthermore, t he Bayesian approach includes techniques to check 
prior beliefs as well as statistical models (IGelmanl2008h . so there seems to be little reason for 
not using a given model at an earlier stage even when dismissing it as "un-true" later (always 
in favour of another model). 



2.2 Bayesian analysis in action 

The operating concept that is at the core of Bayesian analysis is that one should provide an 
inferential assessment conditional on the realized value of ' 9> n , and Bayesian analysis gives a 
proper probabilistic meaning to this conditioning by allocating to 9 a (reference) probability 
(prior) distribution tt. Once the prior distribution is selected, Bayesian inference formally 
is "over"; that is, it is completely determined since the estimation, testing, prediction, 
evaluation, and any other inferential procedures are automatically provided by the prior and 
the associated loss (or penalty) function. 3 For instance, if estimations 9 of 9 are evaluated via 
the quadratic loss function 

L{9,§) = \\e-ef, 

the corresponding Bayes procedure is the expected value of 9 under the posterior distribution, 

for a given sample & n . For instance, observing a frequency 38/58 of survivals among 58 
breast-cancer patients and assuming a binomial £>(58, 9) with a uniform U(0, 1) prior on 9 
leads to the Bayes estimate 



f>(y(i-g) 2 °dg _ 38 + i 

£ (|)038 ( i -9)*° 66 58 + 2 



since the posterior distribution is then a beta Be(38 + 1, 20 + 1) distribution. 

When no specific loss function is available, the estimator (0 is often used as a default 
estimator, although alternatives also are available. For instance, the maximum a posteriori 
estimator (MAP) is defined as 

9 = argmax7r(0|^ n ) = axgmaxTr(9)£(9\@ n ), (3) 

6 9 

where the function to maximize is usually provided in closed form. However, numerical 
problems often make the optimization involved in finding the MAP far from trivial. Note 
also here the similarity of (0 with the maximum likelihood estimator (MLE): The influence 
of the prior distribution n(9) progressively disappears with the number of observations , and 



the MAP estimator recovers the asymptotic properties of the MLE. See ISchervishl d 1 9951) for 
more details on the asymptotics of Bayesian estimators. 



3 Hence the concept, introduced above, of a complete inferential machine. 
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Figure 1 (left) Data describing the survival rates of some breast-cancer patients (Bisho p etal]|l975h 
and (right) representation of two gamma posterior distributions differentiating between malignant 
(dashes) versus non-malignant (full) breast cancer survival rates. 



As an academic example, consider the contingency table provided in Figure Q] on 
survival rate for br east-cancer patients with or without malignant tumours, extracted from 
Bish op et al.1 d 1975b . the goal being to distinguish between the two types of tumour in terms 
of survival probability. We then consider each entry of the table on the number of survivors 
(first column of figures in Figure Q] to be independently Poisson distributed V{Nit6i), 
where t = 1,2,3, denotes the age group, i = 1, 2, the tumor group, distinguishing between 
malignant (i — 1) and non-malignant (i — 2), and Nu is the total number of patients in this 
age group and for this type of tumor. Therefore, denoting by xn the number of survivors in 
age group t and tumor group i, the corresponding density is 

The corresponding likelihood on 6i (i = 1, 2) is thus 

3 

t=i 

which, under an 9 ~ £xp(2) prior, leads to the posterior 

jr(0i|Z> 3 ) oc e x " +x * i+X3i exp{-0i(2 + N u + N 2l + N 3i )} 

i.e. a Gamma T(xu + x 2 i + x-si + 1,2 + Nu + N 2 i + N^i) distribution. The choice of the 
(prior) exponential parameter corresponds to a prior estimate of 50% survival probability 
over the period. 4 In the case of the non-malignant breast cancers, the parameters of the 



4 Once again, this is an academic example. The prior survival probability would need to be assessed by a physician 
in a real life situation. 
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(posterior) Gamma distribution are a — 136 and b — 161, while, for the malignant cancers, 
they are a — 96 and b = 133. Figure[T]shows the difference between both posteriors, the non- 
malignant case being stochastically closer to 1, hence indicating a higher survival rate. (Note 
that the posterior in this figure gives some weight to values of larger than 1. This drawback 
can easily be fixed by truncated the exponential prior at 1.) 



2.3 Prior distributions 

The selection of the prior distribution is an important issue in Bayesian modelling. When 
prior information is available about the data or the model, it can be used in building the 
prior, and we will see some illustrations of this recommendation in the following chapters. 
In many situations, however, the selection of the prior distribution is quite delicate in the 
absence of reliable prior information, and generic solutions must be chosen instead. Since 
the choice of the prior distribution has a considerable influence on the resulting inference, 
this choice must be conducted with the utmost care. It is indeed straightforward to come up 
with examples where a particular choice of the prior leads to absurd decisions. Hence, for a 
Bayesian analysis to be sound the prior distribution needs to be well-justified. Before entering 
into a brief description of some existing approaches of constructing prior distributions, note 
that, as part of model checking, every Bayesian analysis needs to assess the influence of the 
choice of the prior, for instance through a sensitivity analysis. Since the prior distribution 
models the knowledge (or uncertainty) prior to the observation of the data, the sparser the 
prior information is, the flatter the prior should be. There actually exists a category of priors 
whose primary aim is to minimize the impact of the prior selection on the inference: They 
are called noninformative priors and we will detail them below. 

When the sample model is from an exponential family of distributions 5 with densities of 
the form 

fe{x) = h{x) exp {9 ■ R(x) - *(0)} , 9, R(x) € « p , 

where 9 ■ R(x) denotes the canonical scalar product in R p , there exists an associated class of 
priors called the class of conjugate priors, of the form 



tt(0|£,A) ocexp{0-£- Atf(0)} , 

which are parameterized by two quantities, A > and £, A£ being of the same nature as R(y). 
These parameterized prior distributions on are appealing for the simple computational 
reason that the posterior distributions are exactly of the same form as the prior distributions; 
that is, they can be written as 

Tr(6\e(V n ),\'(D n )), (4) 



where (£'(T> n ), X'(T> n )) is defined in terms of the sample of observations T> n ( Robertll2007 



Section 3.3.3). Equation (0]i simply says that the conjugate prior is such that the prior and 
posterior densities belong to the same parametric family of densities but with different 
parameters. In this conjugate setting, it is the parameters of the posterior density themselves 
that are "updated", based on the observations, relative to the prior parameters, instead of 
changing the whole shape of the distribution. To avoid confusion, the parameters involved in 



This covers most of the standard statistical distributions, see Lehmann and Casellal 1 19981) or lRoberJ <2007n . 



6 B AYES IAN INFERENCE 



the prior distribution on the model parameter are usually called hyperparameters. (They can 
themselves be associated with prior distributions, then called hyperpriors.) 

The computation of estimators, of confidence regions or of other types of summaries of 
interest on the conjugate posterior distribution often becomes straightforward. 

As a first illustration, note that a conjugate family of priors for the Poisson model is the 
collection of gamma distributions T(a, b), since 

fe(x)ir{9\a,b) ot8 a - 1+x e-<< b+1)e 

leads to the posterior distribution of 8 given X — x being the gamma distribution Qa(a + 
x, b + 1). (Note that this includes the exponential distribution £xp(2) used on the dataset of 
Figure Q] The Bayesian estimator of the average survival rate, associated with the quadratic 
loss, is then given by 8 = (1 + x\ + X2 + x^)/{2 + N\ + N-2 + N3), the posterior mean.) 

As a further illustration, consider the case of the normal distribution t/V([i, 1), which is 
indeed another case of an exponential family, with 8 = /j,, R(x) = x, and ^(/i) = /i 2 /2. The 
corresponding conjugate prior for the normal mean /1 is thus normal, 

^(A-^A" 1 ) . 

This means that, when choosing a conjugate prior in a normal setting, one has to select both 
a mean and a variance a priori. (In some sense, this is the advantage of using a conjugate 
prior, namely that one has to select only a few parameters to determine the prior distribution. 
Conversely, the drawback of conjugate priors is that the information known a priori on /i 
either may be insufficient to determine both parameters or may be incompatible with the 
structure imposed by conjugacy.) Once £ and A are selected, the posterior distribution on /j, 
for a single observation x is determined by Bayes' theorem, 

Tr(fi\x) oc exp(xfi — p, /2) exp(£/i — A/it 2 /2) 

cx exp {-(1 + A) [ M - (1 + A)- 1 ^ + 0] 2 12} , 

i.e. a normal distribution with mean (1 + A) _1 (x + ^) and variance (1 + A) -1 . An 
alternative representation of the posterior mean is 

A" 1 1 



1 + A- 1 1 + A- 



1 A £ , (5) 



that is, a weighted average of the observation x and the prior mean A -1 ^. The smaller A is, 
the closer the posterior mean is to x. The general case of an iid sample 2l n — (x\, . . . , x n ) 
from the normal distribution ^V{p, 1) is processed in exactly the same manner, since x„ is a 
sufficient statistic with normal distribution jV{p, 1/ra): the l's in (0 are then replaced with 

n _1 's. 

The general case of an iid sample & n = (x\, . . . ,x n ) from the normal distribution 
jV (/x, a 2 ) with an unknown 8 = a 2 ) also allows for a conjugate processing. The normal 
distribution does indeed remain an exponential family when both parameters are unknown. 
It is of the form 

(a 2 )-^/2 exp {-(A>-0 2 +«) /2a 2 } 
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since 

7r((/i, a 2 )|^„) oc (a 2 )-^" 3 / 2 exp {- (A> - £) 2 + a) /2a 2 } 

x(a 2 )-"/ 2 cxp{-(n(fi-x) 2 + s 2 ) /2a 2 } (6) 

oc (a 2 )-^^") exp {- (X^ n )(p - t(®n)) 2 + a(9 n )) /2a 2 } , 

where s 2 = Y27=i ( Xi ~ x ) 2 - Therefore, the conjugate prior on is the product of an inverse 
gamma distribution on a 2 , ^5f(A CT , a/2), and, conditionally on a 2 , a normal distribution on 

/^(e,a 2 /A M ). 

The apparent simplicity of conjugate priors is however not a reason that makes them 
altogether appealing, since there is no further (strong) justification to their use. One of the 
difficulties with such families of priors is the influence of the hyperparameter (£, A). If the 
prior information is not rich enough to justify a specific value of (£, A), arbitrarily fixing 
(£, A) = (£o, Ao) is problematic, since it does not take into account the prior uncertainty on 
(£oj Ao) itself. To improve on this aspect of conjugate priors, a more ameanable solution is to 
consider a hierarchical prior, i.e. to assume that 7 = (£, A) itself is random and to consider 
a probability distribution with density q on 7, leading to 

0|7~7r(0| 7 ) 
7 - 5(7) , 

as a joint prior on (8, 7). The above is equivalent to considering, as a prior on 9 

tt(0) = 7r(0| 7 )g(7)d 7 . 

As a general principle, q may also depend on some further hyperparameters 77. Higher order 
levels in the hierarchy are thus possible, even though the influence of the hyper(-hyper- 
)parameter 77 on the posterior distribution of 6 is usually smaller than that of 7. But multiple 
levels are nonetheless usefu l in complex populations as those found in animal breeding 
( S0rensen and Gianolall2002h . 



Instead of using conjugate priors, even wh en mixed with hyp erpriors, one can opt for 



the so-called noninformative (or vague) priors dRobert et alj|2009l) in order to attenuate the 



impact on the resulting inference. These priors are defined as refinements of the uniform 
distribution, which rigorously does not exist on unbounded spaces. A peculiarity of those 
vague priors is indeed that their density usually fails to integrate to one since they have 
infinite mass, i.e. 

/ 7T(0)d0 = +OO, 

Je 

and they are defined instead as positive measures, the first and foremost example being 
the Lebesgue measure on W. While this sounds like an invalid extension of the standard 
probabilistic framework — leading to their denomination of improper priors — , it is quite 
correct to define the corresponding posterior distributions by ([T), provided the integral in 
the denominator is defined, i.e. 

tt(8)1(9\®„) 66 = m{St n ) < 00 . 
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In some cases, this difficulty disappears when the sample size n is large enough. In others like 
mixture models (see also Section [3j, the impossibility of using a particular improper prior 
may remain whatever the sample size is. It is thus strongly advised, when using improper 
priors in new settings, to check that the above finiteness condition holds. 

The purpose of noninformative priors is to set a prior reference that has very little bearing 
on the inference (relative to the information b rought by the likelihood function). More 



detailed accounts are provided in iRobertl (12007[ Section 1.5) about this possibility of using 



(T-finite measures in settings where genuine probability prior distributions are too difficult to 
come by or too subjective to be accepted by all. 

While a seemingly natural way of constructing noninformative priors would be to fall 
back on the uniform (i.e. flat) prior, this solution has many drawbacks, the worst one being 
that it is not invariant under a change of parameterisation. To understand this issue, consider 
the example of a Binomial model: the observation x is a B(n,p) random variable, with 
p € (0, 1) unknown. The uniform prior ir(p) = 1 could then sound like the most natural 
noninformative choice; however, if, instead of the mean parameterisation by p, one considers 
the logistic parameterisation 9 = \og(p/ (1 — p)) then the uniform prior on p is transformed 
into the logistic density 

tt(0) =e e /(l + e e ) 2 

by the Jacobian transform, which is not uniform. There is therefore a lack of invariance under 
reparameterisation, which in its turn implies that the choice of the parameterisation associated 
with the uniform prior is influencing the resulting posterior. This is generaly considered to 
be a drawback. Flat priors are therefore mostly restricted to location models x ~ p(x — 9), 
while scale models 

x ~ p{x/9)/9 

are associated with the log-transform of a flat prior, that is, 

tt(9) = 1/9. 

In a more gene ral sett ing, the (noninformative) prior favoured by most Bayesians is the so- 
called Jeffreys' d 19391) prior, which is related to Fisher's information I F (9) by 

k j {9) = \I f {9)\ 1 ' 2 , 

where |/| denotes the determinant of the matrix /. 

Since the mean p of a normal model Af(p, 1) is a location parameter, the standard 
choice of noninformative prior is then 7r(/i) = 1 (or any other constant). Given that this 
flat prior formally corresponds to the choice A = // = in the conjugate prior, it is easy to 
verify that this noninformative prior is associated with the posterior distribution j¥(x, 1). 
An interesting consequence of this remark is that the posterior density (as a function 
of the parameter 9) is then equal to the likelihood function, which shows that Bayesian 
analysis subsumes likelihood analysis in this sense. Therefore, the MAP estimator is also 
the maximum likelihood estimator in that special case. Figure [2] provides the posterior 
distributions associated with both the flat prior on p and the conjugate JV{0, 0.1 a 2 ) prior 
for a crime dataset discussed in Marin and Robert d2007h . The difference between both 



posteriors is still visible after 90 observations and it illustrates the impact of the choice of 
the hyperparameter (£, A) on the resulting inference. 
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Figure 2 Two posterior distributions on a normal mean corres ponding to the flat prior ( plain) and a 
conjugate prior (dotted) for a dataset of 90 observations. (Source: Marin and Robert 2007.) 

2.4 Confidence intervals 

As should now be clear, the Bayesian approach is a complete inferential approach. Therefore, 
it covers among other things confidence evaluation, testing, prediction, model checking, and 
point estimation. Unsurprisingly, the derivation of the confidence intervals (or of confidence 
regions in more general settings) is based on the posterior distribution ir(6\@ n ). Since the 
Bayesian approach processes 8 as a random variable and conditions upon the observables 
2# n , a natural definition of a confidence region on 8 is to determine C(@ n ) such that 

TT(eeC(2> n )\@ n ) = 1-Q (7) 

where a is either a predetermined level such as 0.05. 6 or a value derived from the loss 
function (that may depend on the data). 

The important difference from a traditional perspective is that the integration here is done 
over the parameter space, rather than over the observation space. The quantity 1 — a thus 
corresponds to the probability that a random 8 belongs to this set C(@ n ), rather than to 
the probability that the random set contains the "true" value of 8. Given this drift in the 
interpretation of a confidence set (rather called a credible set by Bayesians in order to 
stress this major difference with the classical confidence set), the determination of the best 7 
confidence set turns out to be easier than in the classical sense: It simply corresponds to the 
values of 8 with the highest posterior values, 

C{9 n ) = {8; ir(8\® n ) > k a } , 

where k a is determined by the coverage constraint (0. This region is called the highest 
posterior density (HPD) region. 

When the prior distribution is not conjugate, the posterior distribution is not necessarily so 
easily-managed. For instance, if the normal -^(/i, 1) distribution is replaced with the Cauchy 

'There is nothing special about 0.05 when compared with, say, 0.87 or 0.12. It is just that the famous 5% level 
is adopted by most as an acceptable level of error. 

7 In the sense of offering a given confidence coverage for the smallest possible length/volume. 
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Figure 3 (left) Posterior distribution of the locati on parameter fj, of a Cauc hy sample for a ,jV(0, 10) 
prior and corresponding 95% HPD region (Source: iMarin and Robertll2007h : (right) Representation of 
a posterior sample of 10 3 values of (6, a 2 ) for the normal model, x\ , . . . , x\o ~ N{9, a 2 ) with x = 0, 
s 2 = 1 and n = 10, under Jeff reys' prior, along with th e pointwise approximation to the 10% HPD 
region (in darker hues) (Source: iRobert arid Wraith 200^). 



distribution, ^(/i, 1), in the likelihood 

n I n 

= n m**) = 1 h n + - ^) 2 ) > 

1=1 ' 1=1 

there is no conjugate prior available and we can consider a normal prior on [i, say ^(0, 10). 
The posterior distribution is then proportional to 

/ " 

fr(A*|#n) - cxp(- M 2 /20) /nC 1 + fa - M) 2 ) • 

Solving 7r(/i|^„) = fc is not possible analytically, only numerically, and the derivation of the 
bound k a requires some amount of trial-and-error in order to obtain the correct coverage. 
Figure [3] gives the posterior distribution of /i for the observations x\ — —4.3 and x-i = 3.2. 
For a given value of k, a trapezoidal approximation can be used to compute the approximate 
coverage of the HPD region. For a = 0.95, a trial-and-error exploration of a range of values 
of k then leads to an approximation of k a — 0.0415 and the corresponding HPD region is 
represented in Figure[3](f/e/fJ. 

As illustrated in the above example, posterior distributions are not necessarily unimodal 
and thus the HPD regions may include several disconnected sets. This may sound 
counterintuitive from a classical point of view, but it must be interpreted as indicating 
indeterminacy, either in the data or in the prior, about the possible values of 8. Note also that 
HPD regions are dependent on the choice of the reference measure that defines the volume 
(or surface). 



B AYES IAN INFERENCE 11 



The analytic derivation of HPD regions is rarely straightforward but let us stress that, due to 
the fact that the posterior density is most known up to a normalising constant, those regions 
can be easily derived from posterior simulations. For instance, Figure [3] (right) illustrates 
this derivation in the case of a normal Af(9,a 2 ) model with both parameters unknown 
and Jeffreys' prior, when the sufficient statistics are x — and s 2 = 1, based on n = 10 
observations. 



3 Testing Hypotheses 

Deciding about the validity of some restrictions on the parameter 9 or on the validity of 
a whole model — like whether or not the normal distribution is appropriate for the data at 
hand — is a major and maybe the most important component of statistical inference. Because 
the outcome of the decision process is clearcut, accept (coded by 1) or reject (coded by 0), 
the construction and the evaluation of procedures in this setup are quite crucial. While the 
Bayesian solution is formally very close to a likelihood ratio statistic, its numerical values 
and hence its conclusions often strongly differ from the classical solutions. 



3. 1 Decisions 

Without loss of generality, and including the setup of model choice, we represent null 
hypotheses as restricted parameter spaces, namely 9 £ Oo- For instance, 9 > corresponds 
to 9o = K + . The evaluation of testing procedures can be formalised via the — 1 loss that 
equally penalizes all errors: If we consider the test of Hq : 9 € Oo versus H i : £ Oo, 
and denote by d € {0, 1} the decision made by the researcher and by 5 the corresponding 
decision procedure, the loss 



L(6,d) 

is associated with the Bayes decision (estimator) 

6*(x) 



l-d if 9 e e . 

d otherwise, 



1 if F*(0 e &o\x) > P 7T (9 & e \x), 
otherwise. 



This estimator is easily justified on an intuitive basis since it chooses the hypothesis with the 
largest posterior probability. The Bayesian testing procedure is therefore a direct transform 
of the posterior probability of the null hypothesis. 



3.2 The Bayes Factor 

A notion central to Bayesian testing is the Bayes factor 

w = P ff (fleei|aQ/P*(fl€ e |a:) 

10 p*(9e e 1 )/p^{9€ e ) ' 

which corresponds to the classical odds or likelihood ratio, the difference being that the 
parameters are integrated rather than maximized under each model. While it is a simple one- 
to-one transform of the posterior probability, it can be used for Bayesian testing without 
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resorting to a specific loss, evaluating th e strength of th e evidence in favour or against Hq 
by the distance of \og 10 (B^ Q ) from zero (|Jeffrevslll939l) . This somehow ad-hoc perspective 
provides a reference for hypothesis assessment with no need to define the prior probabilities 
of Hq and Hi, which is one of the advantages of using the Bayes factor. In general, the Bayes 
factor does depend on prior information, but it can be perceived as a Bayesian likelihood 
ratio since, if ttq and m are the prior distributions under Hq and Hi, respectively, _Bf can 
be written as 

thus replacing the likelihoods with the marginals under both hypotheses. Thus, by integrating 
out the parameters within each hypothesis, the uncertainty on each parameter is ta ken into 
accoun t, which induces a natural penalisation for larger models, as intuited by Ijeffrevsl 
( 1939|). The Bayes factor is connected with the Bayesian information criterion (BIC, see 
Robert] E007[ Chapter 5), with a penalty term of the form cilogn/2, which explicits the 
penalisation induced by Bayes factors in regular parametric models. In a wide generality, 
the Bayes factor asymptotically corresponds to a likelihood ratio with a penalty of the form 
d* logn*/2 where d* and n* can be vie wed as the effective dimension of the model an d 
number of observations, respectively, see (IBerger et al.ll2003UChambaz and Ro usseau 2008). 
The Bayes factor therefore offers the major interest that it does not require to compute a 
complexity measure (or penalty term) — in other words, to define what is d* and what is 
n* — , which often is quite complicated and may depend on the true distribution. 

3.3 Point null hypotheses 

When the hypothesis to be tested is a point null hypothesis, Hq : 8 = 8q, there are difficulties 
in the construction of the Bayesian procedure, given that, for an absolutely continuous prior 

71", 

P*{6 = 6> ) = . 

Rather logically, point null hypotheses can be criticized as being artificial and impossible 
to test {how often can one distinguish 8 — from 8 = 0.0001?/), but they must also 
be processed, being part of the everyday requirements of statistical analysis and also a 
convenient representation of some model choice problems (which we will discuss later). 

Testing point null hypotheses actually requires a modification of the prior distribution so 
that, when testing Hq : 8 G 8o versus Hi : G Oi, 

7r(6 ) > and tt(0i) > 

hold, whatever the measures of ©o and Oi for the original prior, which means that the prior 
must be decomposed as 

n{8) = P*(8 G 9 ) x 770(61) + P 7r (6> G 9i) x m(8) 

with positive weights on both ©o and ©i. 

Note that this modification makes sense from both informational and operational points of 
view. If Hq : 8 = 8q, the fact that the hypothesis is tested implies that 8 = 8q is a possibility 
and it brings some additional prior information on the parameter 8. Besides, if Hq is tested 
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Table 1 Posterior probability of /i = for different values 
of 2 = x/a, p — 1/2, and for r = a (top), r 2 = 10a 2 
(bottom). 



z 





0.68 


1.28 


1.96 


7r(M = 0|z) 


0.586 


0.557 


0.484 


0.351 


n(p = 0\z) 


0.768 


0.729 


0.612 


0.366 



Source : \Marin and Ro bert 2007. 



and accepted, this means that, in most situations, the (reduced) model under Hq will be 
used rather than the (full) model considered before. Thus, a prior distribution under the 
reduced model must be available for potential later inference. (Formaly, the fact that this 
later inference depends on the selection of Hq should also be taken into account.) 

In the special case 8o = {9q}, 1S the Dirac mass at 9q, which simply means that 
P"° (9 = 9q) = 1, and we need to introduce a separate prior weight of Hq, namely, 

p = P 7! (9 = 9 ) and tt(0) = ple o (0) + (1 - p)in(0) . 



Then, 



n(Q \x) 



ft>o ( X )P 



fe (x)p 



f fe{x)ir{9)d9 fe (x)p+(l-p) mi (xy 

In the case when x ~ Af([i>, u 2 ) and p ~ A/"(^, t 2 ), consider the test of Hq : p — 0. We 
can choose £ equal to if we do not have additional prior information. Then the Bayes factor 
is the ratio of marginals under both hypotheses, p = and p, ^ 0, 



E)7r 



and 



n(p = 0|x) 



mi(x) a 
fo(x) Ver 2 + t 2 



e -x 2 /2a 2 



1 



1-P 
P 



■ exp 



2a 2 (a 2 



is the posterior probability of Hq. Table Q] gives an indication of the values of the posterior 
probability when the normalized quantity x/a varies. This posterior probability again 
depends on the choice of the prior variance r 2 : The dependence is actually quite severe, 
as shown below with the Jeffreys-Lindley paradox. 



3.4 The Ban on Improper Priors 

Unfortunately, this decomposition of the prior distribution into two subpriors brings a serious 
difficulty related to improper priors, which amounts in practice to banning their use in testing 
situations. In fact, when using the representation 

tt(0) = P*(0 E Go) x 7T O (0) + P*(0 e GO x m(9) , 

the weights P*(9 £ Go) and P^{9 £ Gi) are meaningful only if ttq and m are normalized 
probability densities. Otherwise, they cannot be interpreted as weights. 
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Figure 4 Range of the B ayes factor B^p when r goes from 10 4 to 10. (Note: The a>axis is in 
logarithmic scale.) (Source: Marin and Robert 200%) 



Table 2 Posterior probability of Ho '■ fi = for the Jeffreys 
prior Tri(fi) — 1 under Hi. 



X 


0.0 


1.0 


1.65 


1.96 


2.58 


7r(/x = 0|rr) 


0.285 


0.195 


0.089 


0.055 


0.014 



Source: \Marin and RoberklOO'X 



In the instance when x ~ A/"(/Lt, 1) and Hq : = 0, the improper (Jeffreys) prior is 
7Ti(/z) = 1; if we write 

t(m) = oIo(M) 



1 



then the posterior probability is 

7r(/i = 0|x) = — 



-x 2 /2 



J — oo 

A first consequence of this choice is that the posterior probability of Hq is bounded from 
above by 

tt(m = 0» < 1/(1 + V^7r) = 0.285 . 

Table |2]provides the evolution of this probability as x goes away from 0. An interesting point 
is that the numerical va lues somehow coincide with the p-values used in classical testing 
dCasella and BergeifcOOll) . 
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If we are instead testing Hq : 9 < versus Hi : 9 > 0, then the posterior probability is 



and the answer is now exactly the p-value found in classical statistics. 

The difficulty in using an improper prior also relates to what is called the Jeffreys-Lindley 
paradox, a phenomenon that shows that limiting arguments are not valid in testing settings. 
In contrast with estimation settings, the noninformative prior no longer corresponds to the 
limit of conjugate inferences. In fact, for a conjugate prior, the posterior probability 



converges to 1 when r goes to +oo, for every value of x, as already illustrated by 
Figure [4] This noninformative procedure differs from the noninformative answer [1 + 
V / 27rexp(x 2 /2)]~ 1 above. 

The fundamental issue that bars us from using improper priors on one or both of the sets 
Go and 0i is a normalizing difficulty: If go and gi are measures (rather than probabilities) 
on the subspaces 0o and 0i, the choice of the normalizing constants influences the Bayes 
factor. Indeed, when <?; is replaced by cigi (i = 0, 1), where ci is an arbitrary constant, the 
Bayes factor is multiplied by Co /c\ . Thus, for instance, if the Jeffreys prior is flat and go = cq, 
gx = c\, the posterior probability 



is completely determined by the choice of cq/c\. This implies, for instance, that the function 
[1 + \phi exp(x 2 /2)] _1 obtained earlier has no validity whatsoever. 

Since improper priors are an essential part of the Bayesian approach, there have been many 
proposals to overcome this ban. Most use a device that transforms the prior into a proper 
probability distribution by using a portion of the data @ n and then use the other part of the 
data to run the test as in a standard situation. The variety of available solutions is due to the 
many possibilities of removing the dependence on the choice of the portion of the data used 
in the first step. The resulting procedures are called pseudo-Bay es factors . See Robert (2007, 
Chapter 5) for more details. 

3.5 The case of nuisance parameters 

In some settings, some parameters are shared by both hypotheses (or by both models) that 
are under comparison. Since they have the same meaning in each of both models, the above 
ban can be partly lifted and a common improper prior can be used on these parameters, in 
both models. 

For instance, consider a regression model, represented as 





tt(0 G O \x) 



PoCoL fe{x)d9 



Poc Je fe(x) d9 + (1 - po)a f @i f e {x) AO 



y|x,/3,o- 



M(X(3,a 2 I n ), 



(8) 
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where X denotes the (n,p) matrix of regressors — upon which the whole analysis is 
conditioned — , y the vector of the n observations, and /3 is the vector of the regression 
coefficients. (This is a matrix representation of the repeated observation of 

y { = PxXii + . . . + (ipx ip + oti ti ~ 7V(0, 1) , 

when i varies from 1 to n.) Variable selection in this setup means removing covariates, that 
is, columns of X, that are not significantly contributing to the expectation of y given X. In 
other words, this is about testing whether or not a null hypothesis like Hq : /3\ = holds. 
From a Bayesian perspective, a possible non inform ative prior distribution on the generic 
regression model ([8} is the so-called Zellner's (1986) g-prior, where the conditional 8 7r(/3|<r) 
prior density corresponds to a normal 

7V(0,ncr 2 (X T X)- 1 ) 

distribution on (3, A T denoting the transposed matrix associated with A, and where a 
"marginal" improper prior on a 2 , 7t(ct 2 ) = cr~ 2 , is used to complete the joint distribution. 
With this default (or reference) prior modelling, and when considering the submodel 
corresponding to the null hypothesis Hq : /3i = 0, with parameters (i^ 1 ^ and a, we can use 
a similar g-prior distribution 

/3 ( - 1} |a,X- A/XCW^XljX-!)- 1 ), 

where X_i denotes the regression matrix missing the column corresponding to the first 
regressor, and a 2 ~ tt(<t 2 ) = o~ 2 . Since a is a nuisance parameter in this case, we may 
use the improper prior on a 2 as common to all submodels and thus avoid the indeterminacy 
in the normalising factor of the prior when computing the Bayes factor 

= J /(y|/3-i, a, XM/?^ 1 ) \a, X_i)dg-i a' 2 da 
01 J/(y|/3,a,XV(/3|a,X)d/3a- 2 da 

Figure [5] reproduces a computer output from iMarin and Robert (2007) that illustrates how 



this default prior and the corresponding Bayes factors can be used in the same spirit as 
significance levels in a standard regression model, each Bayes factor being associated with 
the test of the nullity of the corresponding regression coefficient. For instance, only the 
intercept and the coefficients of Xi, X2, X4, X$ are significant. This output mimics the 
standard lm R function outcome in order to show that the level of information provided by 
the Bayesian analysis goes beyond the classical output. (We stress that all items in the table of 
Figure|5]are obtained via closed-form formulae.) Obviously, this reproduction of a frequentist 
output is not the whole purpose of a Bayesian data anlysis, quite the opposite: it simply 
reflects on the ability of a Bayesian analysis to produce automated summaries, just as in the 
classical case, but the inferencial abilities of the Bayesian approach are considerably wider. 
(For instance , testing simultaneously the nullity of ^3, /3q, . . . , /3io is of identical difficulty, 



as detailed in Marin and Rober3|2007] Chapter 3.) 



The fact that the prior distribution depends on the matrix of regressors X is not contradictory with the Bayesian 
paradigm in that the whole analysis is conditional on X. The potential randomness of the regressors is not accounted 
for in this analysis. 
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Estimate 


BF 


log10(BF) 


(Intercept) 


9.2714 


26.334 


1.4205 (***) 


X1 


-0.0037 


7.0839 


0.8502 (**) 


X2 


-0.0454 


3.6850 


0.5664 (**) 


X3 


0.0573 


0.4356 


-0.3609 


X4 


-1.0905 


2.8314 


0.4520 (*) 


X5 


0.1953 


2.5157 


0.4007 (*) 


X6 


-0.3008 


0.3621 


-0.4412 


X7 


-0.2002 


0.3627 


-0.4404 


X8 


0.1526 


0.4589 


-0.3383 


X9 


-1.0835 


0.9069 


-0.0424 


X10 


-0.3651 


0.4132 


-0.3838 



evidence against HO: (****) decisive, (***) strong, (**) substantial, (*) poor 

Figure 5 R output o f a Bayesian regression a nalysis on a processionary caterpillar dataset with ten 
covariates analysed in lMarin and Roberll ( l2007h . The Bayes factor on each row corresponds to the test 
of the nullity of the corresponding regression coefficient. 



4 Extensions 

The above description of inference is only an introduction and is thus not representative of 
the wealth of possible applic ations resulting from a B ayesian modelling. We consider below 
two extensions inspired from Marin and Roberll ( 2007 ). 



4. 1 Prediction 

When considering a sample 3> n — [x\, . . . , x n ) from a given distribution, there can be 
a sequential or dynamic structure in the model that implies that future observations are 
expected. While more realistic modeling may involve probabilistic dependence between the 
Xi's, we consider here the simpler setup of predictive distributions in iid settings. 

If x n +i is a future observation from the same distribution fg(-) as the sample @ n , its 
predictive distribution given the current sample is defined as 

r(x n+1 \® n )= J f{x n+1 \6,® n )n{6\$) n )de = J fo(x n+1 )Tr(6\@ n )d6. 

The motivation for defining this distribution is that the information available on the 
pair (x n +i,9) given the data Q> n is summarized in the joint posterior distribution 
fo(x n j r i)Tr{9\Si n ) and the predictive distribution above is simply the corresponding marginal 
on Xn+i- This is nonetheless coherent with the Bayesian approach, which then considers 
x n+ i as an extra unknown. 

For the normal ,yV(^L : a 2 ) setup, using a conjugate prior on (/i, a 2 ) of the form 

(a 2 )- A "- 3/2 exp-{A>-0 2 + "}/2a 2 , 
the corresponding posterior distribution on (/i, a 2 ) given 3> n is 



2 i I— t\2 

s x + - — —{x - n 
x Xn+ n . 



/2 
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denoted by 



Jf (a®n),o 2 /\^n)) x J^Sf (A CT (^„),a(^„)/2) , 



and the predictive on x n+ i is derived as 



r( Xn+1 \@ n ) CX J ( (T 2 ) -A.-2-n/2 eX p -(z n+1 - M ) 2 /2<7 2 

x exp- {A„(0 n )(/i - £(f?„)) 2 + a(^„)} /2a 2 d(^,a 2 ) 
cx | ( (T 2 ) -A.-„/2-3/2 cxp _ { {X ^ n) + l){Xn+l _ e( ^ n)) 2 

/A M (5y + a(^„)}/2<7 2 da 2 




Therefore, the predictive of x n+ i given the sample is a Student's t distribution with 
mean and 2X a + n degrees of freedom. In the special case of the noninformative 

prior, A p = \ a = a = and the predictive is 



This is again a Student's t distribution with mean x n ), scale s x j^fn, and n degrees of 
freedom. 

4.2 Outliers 

Since normal modeling is often an approximation to the "real thing," there may be doubts 
about its adequacy. As already mentioned above, we will deal later with the problem of 
checking that the normal distribution is appropriate for the whole dataset. Here, we consider 
the somehow simpler problem of assessing whether or not each point in the dataset is 
compatible with normality. There are many different ways of dealing with this problem. 
We choose here to take advantage of the derivation of the predictive distribution above: If an 
observation Xi is unlikely under the predictive distribution based on the other observations, 
then we can argue against its distribution being equal to the distribution of the other 
observations. 

For each Xi <G 2# n , we consider f^{x\'3' l n ) as being the predictive distribution based 
on = (xi, . . . ,Xi-i,Xi + i, . . . ,x n ). Considering f?(xi\<&$) or the corresponding cdf 
F?{xi\$l l v ) (in dimension one) gives an indication of the level of compatibility of the 
observation with the sample. To quantify this level, we can, for instance, approximate the 
distribution of FT{x^9f£) as uniform over [0, 1] since F^(-\^) converges to the true cdf of 
the model. Simultaneously checking all F^(xi\^) over i may signal outliers. 

The detection of outliers must pay attention to the Bonferroni fallacy, which is that extreme 
values do occur in large enough samples. This means that, as n increases, we will see smaller 
and smaller values of F? (xi 1 2> l n ) even if the whole sample is from the same distribution. The 
significance level must therefore be chosen in accordance with this observation, for instance 



n 



-, -(n+l)/2 



f*{x n+l \3i n )<x s 2 + 



n+ 1 



( ^n) 
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using a bound a on F^(xi\^) such that 

l-(l-a) n = l-a, 
where a is the nominal level chosen for outlier detection. 

4.3 Model choice 

For model choice, i.e. when several models are under comparison for the same observation 

M l : x ~ fi(x\0i) , i€J, 

where 3 can be finite or infinite, the usual Bayesian answer is similar to the Bayesian 
tests as described above. The most coherent perspective (from our viewpoint) is actually 
to envision the tests of hypotheses as particular cases of model choices, rather than trying 
to justify the modification of the prior distribution criticised by iGelmanl d2008l) . This also 
inco rporates within model cho ice the alternative solution of model averaging, proposed 



bv iMadigan an d Rafterv (1994), which strives to keep all possible models when drawing 
inference. 

The idea behind Bayesian model choice is to construct an overall probability on the 
collection of models UigjSti in the following way: the parameter is = i.e. the 

model index and given the model index equal to i, the parameter 6>; in model 3Jtj, then the 
prior measure on the parameter 8 is expressed as 

d7r(0) = ^PidKitfi), J2pi = l- 
ie3 ieii 

As a consequence, the Bayesian model selection associated with the 0-1 loss function and 
the above prior is the model that maximises the posterior probability 



7r(9Ki|ar) 



Eft / fMOfriWiWi 

across all models. Contrary to classical pluggin likelihoods, the marginal likelihoods involved 
in the above ratio do compare on the same scale and do not require the models to be nested. As 
mentioned in Section [331 integrating out the parameters Oi in each of the models takes into 
account their uncertainty thus the marginal likelihoods J Q fi(x\9i)TTi(8i)d9i are naturally 
penalised likelihoods. In most parametric setups, when the number of parameters does not 
grow to infinity with the number of observations and when those parameters are identifiable, 
the Bayesian model selector as defined above is consistent, i.e. with increasing numbers of 
observations, the probability of choosing the right model goes to 1. 
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